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Abstract. Bounds on the risk play a crucial role in statistical learning theory. 
They usually involve as capacity measure of the model studied the VC dimension or 
one of its extensions. In classification, such "VC dimensions" exist for models taking 
values in {0, 1}, {1, . . . , Q} and R. We introduce the generalizations appropriate 
for the missing case, the one of models with values in M9 . This provides us with a 
new guaranteed risk for M-SVMs which appears superior to the existing one. 
Keywords: Large margin classifiers, Generalized VC dimensions, M-SVMs. 



1 Introduction 



Vapnik's statistical learning theory |Vapnik, 1998 deals with three types of 



problems: pattern recognition, regression estimation and density estimation. 
However, the theory of bounds has primarily been developed for the compu- 
tation of dichotomies only. Central in this theory is the notion of "capacity" 
of classes of functions. In the case of binary classifiers, the measure of this ca- 
pacity is the famous Vapnik-Chervonenkis (VC) dimension. Extensions have 
also been proposed for real- valued bi-class models and multi-class models tak- 
ing theirs values in the set of categories. Strangely enough, no generalized VC 
dimension was available so far for Q-category classifiers taking their values 
in M9 . This was all the more unsatisfactory as many classifiers exhibit this 
property, such as the multi-layer perceptrons, or the multi-class support vec- 
tor machines (M-SVMs). In this paper, the scale-sensitive •Z'-dimensions are 
introduced to fill this gap. A generalization of Sauer's lemma |Sauer, 1972| is 
given, which relates the covering numbers appearing in the standard guaran- 
teed risk for large margin multi-category discriminant models to one of these 
dimensions, the margin Natarajan dimension. This latter dimension is then 
bounded from above for the architecture shared by all the M-SVMs proposed 
so far. This provides us with a sharper bound on their sample complexity. 
The organization of the paper is as follows. Section [2] introduces the basic 
bound on the risk of large margin multi-category discriminant models. In 
Section [3l the scale-sensitive •f'-dimensions are defined, and the generalized 
Sauer lemma is formulated. The upper bound on the margin Natarajan di- 
mension of the M-SVMs is then described in Section |U For lack of space, 
proofs are omitted. They can be found in |Guermeur, 2004| . 
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2 Basic theory of large margin Q-category classifiers 

We consider Q-category pattern recognition problems, with 3 < Q < oo. A 
pattern is represented by its description x E X and the set of categories y 
is identified with the set of indices of the categories, {1, . . . , Q}. The link 
between patterns and categories is supposed to be probabilistic. X and y 
are probability spaces, and X x y is endowed with a probability measure P, 
fixed but unknown. Let (X, Y) be a random pair distributed according to 
P. Training consists in using a m-sample s m — {{Xi,Yi)) 1<i<m of indepen- 
dent copies of {X,Y) to select, in a given class of functions Q, a function 
classifying data in an optimal way. The criterion to be optimized, the risk, 
is the expectation with respect to P of a given loss function. The way the 
functions in Q perform classification must be specified. We consider classes 
of functions from X into UP . g = (.9fe) 1<fc <Q S Q assigns x € X to the 
category I if and only if gi(x) > max^/ gk\x). Cases of ex aequo are treated 
as errors. This calls for the choice of a loss function I defined on Q x X x y 
by £(y,g(x)) = l{ Sy (x)<max^ y The risk of g is then given by: 

R(g) —E[£ (y, g (X))} = f l {gv(x) < max ^ y 9k{x)} dP{x, y). 

This study deals with large margin classifiers, when the underlying notion of 
multi-class margin is the following one. 

Definition 1 (Multi-class margin). Let g be a function from X into MP . 
Its margin on (a;, y) € X x y, M xy (g, x, y), is given by: 

M xy {g, x,y) = ^ ^g y (x) - rn&xg k {x) 

Basically, the central elements to assign a pattern to a category and to derive 
a level of confidence in this assignation are the index of the highest output 
and the difference between this output and the second highest one. The class 
of functions of interest is thus the image of Q by application of an appropriate 
operator. Two such "margin operators" are considered here, A and A*. 

Definition 2 (A operator). Define A as an operator on Q such that: 

A : Q — > AQ 

g 49 = (49fc)i<fe<Q 

Vie e X , Ag{x) = ]- ( gk(x) - maxgi(x)) 

1 \ / l<fc<Q 



\/(g,x) EG x X, let M x (g,x) = max fc Ag k (x). 
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Definition 3 (A* operator). Define A* as an operator on Q such that: 

A* : g — ► A*g 

g^A*g = (A*g k ) 1 ^ Q 
Vie € X , A*g(x) = (sign (Ag k (x)) ■ M x (g, ar)) 1 < fc < Q . 

In the sequel, A$ is used in place of A and A* in the formulas that hold true 
for both operators. The empirical margin risk is defined as follows. 

Definition 4 (Margin risk). Let 7 G R+. The risk with margin 7 of g, 
FLf(g), and its empirical estimate on s m , R-y. Sm (g), are defined as: 

f 1 m 

For technical reasons, it is useful to squash the functions A&gk as much 
as possible without altering the value of the empirical margin risk. This is 
achieved by application of another operator. 

Definition 5 (7r 7 operator [ Bartlett, 1998| ). For 7 £ R^, define 7r 7 as 
an operator on Q such that: 

7r 7 : Q — ► -KryQ 

g h-> Ti 1 g = {TT 1 gk))i<k<Q 

Va; S X , TT 1 g{x) = (sign {j3h{x)) • min {\gk(x)\, 7))i<fe<Q • 

Let denote 7r 7 o and Zi^C? be defined as the set of functions A#g. 
The capacity of A#Q is characterized by its covering numbers. 

Definition 6 (e-cover, e-net and covering numbers). Let (E,p) be a 
pseudo-metric space, E' C E and e € R2_. An e-cover of £" is a coverage of 
E' with open balls of radius e the centers of which belong to E. These centers 
form an e-net of E' . A proper e-net of E' is an e-net of £" included in E' . If 
has an e-net of finite cardinality, then its covering number Af{e 1 E' , p) is 
the smallest cardinality of its e-nets. If there is no such finite cover, then the 
covering number is defined to be 00. (e, E' , p) will designate the covering 
number of E 1 obtained by considering proper e-nets only. 

The covering numbers of interest use the following pseudo-metric: 

Definition 7 (functional pseudo-metric). Let Q be a class of functions 
from X into MP . For a set sx™ C X of cardinality n, define the pscudo- metric 

V(5,5')e£ 2 , d ioo/ao{sxn) {g,g') = max \\g(x) - g'ix)^ . 
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Let A/"i£Ue,4#S,n) = 8up a „ nCX ArW(e,A#g,d laa Msxn) ). The foll owing 
theorem extends to the multi-class case Corollary 9 in |Bartlett, 1998| . 

Theorem 1 (Theorem 1 in | Gu ermeur, 2004] ). Let s m be a m-sample 
of examples independently drawn from a probability distribution on X x y. 
With probability at least 1 — 5, for every value 0/7 in (0, 1], the risk of any 
function g in a class Q is bounded from above by: 

R{g) < R~t, Sm {g) + ^ ( ln ( 2 a4 p) oo(7/4, A*g, 2m)) + In + i. 

(1) 



Studying the sample complexity of a classifier Q can thus amount to comput- 
ing an upper bound on N^^/A, A*Q, 2m). In |Guermeur et al, 2005| , 
we reached this goal by relating these numbers to the entropy numbers of 
the corresponding evaluation operator. In the present paper, we follow the 
traditional path of VC bounds, by making use of a generalized VC dimension. 

3 Bounding covering numbers in terms of the margin 
Natarajan dimension 

The if'-dimensions are the generalized VC dimensions that characterize the 
learnability of classes of {1, ... , Q}-valued functions. 

Definition 8 (t^-dimensions | Ben-David et al., 1995|). Let J 7 be a class 
of functions on a set X taking their values in the finite set {1, . . . , Q}. Let 
\P be a set of mappings tp from {1, . . . , Q} into { — !,!,*}, where * is thought 
of as a null element. A subset sx™ — {xi : 1 < i < n) of X is said to be 
^ -shattered by T if there is a mapping ip n — ( - < - 1 \ ■ ■ ■ , V* > • • • > i?^) in ^ n 
such that for each vector v y of {— 1, 1}", there is a function f y in T satisfying 

(^ofyiXi)) =Vy. 

The -dimension of T , denoted by !f - -dim(J r ), is the maximal cardinality of 
a subset of X ^-shattered by T, if it is finite, or infinity otherwise. 

One of these dimensions needs to be singled out, the Natarajan dimension. 

Definition 9 (Natarajan dimension |Ben-David et al., 1995|). Let T 

be a class of functions on a set X taking their values in {1, . . . ,Q}. The 
Natarajan dimension of T ', N-dim(J r ), is the if'-dimension oi T in the specific 
case where 9 is the set of Q{Q — 1) mappings tpk,h (1 < k ^ I < Q), such 
that ipk,l takes the value 1 if its argument is equal to k, the value —1 if its 
argument is equal to I, and * otherwise. 
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The fat-shattering dimension characterizes the uniform Glivenko-Cantelli 
classes among the classes of real- valued functions. 

Definition 10 (fat-shattering dimension |Alon et al., 1997| ). Let Q 

be a class of functions from X into K. For 7 6 Ml , sx™ = {xi : 1 < i < n} C 
X is said to be ^-shattered by Q if there is a vector i?& = (6,;) G M" such that, 
for each vector v y = (j/j) G { — 1, 1}™, there is a function g y £ Q satisfying 

Vi G {1, . . . , n} , t/i (5 w Oi) - 6,) > 7. 

The fat- shattering dimension of 5, P 7 -dim (C?), is the maximal cardinality of 
a subset of X 7-shattered by Q, if it is finite, or infinity otherwise. 

Given the results available for the ^-dimensions and the fat-shattering dimen- 
sion, it appears natural, to study the generalization capabilities of classifiers 
taking values in R^, to consider the use of capacity measures obtained as 
mixtures of the two concepts, namely scale-sensitive if'-dimensions. 

Definition 11 (^-dimension with margin 7). Let Q be a class of func- 
tions on a set X taking their values in MP. Let !? be a family of map- 
pings V horn {1,...,Q} into { — 1,1,*}. For 7 G K+, a subset sx^ = 

{xi : 1 < i < n] of X is said to be 7 shattered by A#Q if there is a mapping 
= 5 . ; ^(n)) in ^« and a vector Vb = in f guch thatj 

for each vector v y — (j/j) of { — 1,1}", there is a function g y in Q satisfying 



Vi G {1, . . . , n} 



if & = 1,3k: (k) = 1 A ^^.jfefo) - 6; > 7 
if W - -1, 31 : ^W(Z) = -1 A A*g ytl (x i ) + b i > 7 



The -f-'P- dimension of A#Q, ^-dim(A^ Q , 7), is the maximal cardinality of a 
subset of X 7-^-shattered by A#Q, if it is finite, or infinity otherwise. 

The margin Natarajan dimension is defined accordingly. 

Definition 12 (Natarajan dimension with margin 7). Let Q be a class 
of functions on a set X taking their values in MP. For 7 G Ml., a subset 
S X" = { x i :1 <i < n} oi X \s said to be j-N- shattered by A#Q if there is 
a set I(sx^) — {(ix(xi),i2(xi)) : 1 < i < n} of n pairs of distinct indices in 
{1, . . . , Q} and a vector Vb = (6j) in M™ such that, for each binary vector 
v y = (yi) G { — 1,1}", there is a function g y in Q satisfying 

Vie{l,. ..,»}, = ^JfeNf'H^. 

The Natarajan dimension with margin 7 of the class A*Q, N-dim(zA#5,7), 
is the maximal cardinality of a subset of X 7-N-shattered by A#Q, if it is 
finite, or infinity otherwise. 

For this scale-sensitive ^-dimension, the connection with the covering num- 
bers of interest, or generalized Saucr lemma, is the following one. 
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Theorem 2 (Theorem 4 in [ Guermeur, 2004] )■ Let Q be a class of func- 
tions from a domain X into "Wfl. For every value of 7 in (0, 1] and every 
m G N* satisfying 2m > N-dim {A^Q , 7 /24) , the following bound is true: 

^(7/4, A*^, 2m) < 2 (288 m Q 2 (Q - 1)) ^(MCQ-i)/^ (2) 

where d = N-dim (A 7 G,j/ '24). 

This theorem is the central result of the paper (and the novelty in the revised 
version of |Guermeur, 2004 ). What makes it a nontrivial Q-class extension 



of Lemma 3.5 in Alon et al 



, 1997| is the presence of both margin operators. 



The reason why A* appears in the covering number instead of A is the very 
principle at the basis of all the variants of Sauer's lemma: two functions sep- 
arated with respect to the functional pseudo- metric used (here ^(a^n)) 
shatter (at least) one point in sx^ . This is true for A*G, or more precisely its 
77-discretization, not for A 7 Q (see Section 5.3 in |Guermeur, 2004| for details). 
One can derive a variant of Theorem [2] involving N-dim (A*G,j/2A). This 
alternative is however of lesser interest, for reasons that will appear below. 



4 Margin Natarajan dimension of the M-SVMs 

We now compute an upper bound on the margin Natarajan dimension of in- 
terest when Q is the class of functions computed by the M-SVMs. These large 
margin classifiers arc built around a Mercer kernel. Let k be such a kernel 
on X and (H K , (., ,) H ) the corresponding reproducing kernel Hilbert space 
(RKHS) |Aronszajn, 1950] . Let # be any of the mappings on X satisfying: 

V(x,x') 6 X 2 , k(x,x') = (${x),${x')), (3) 

where (.,.) is the dot product of the ^2 space. "The" feature space tradi- 
tionally designates any of the Hilbert spaces (E$tx)i (-j ■)) spanned by the 
${X). By definition of a RKHS, H = ((H K , (., .) H J + {1}) Q is the class of 
functions h = (hk) 1<k< Q from X into of the form: 

h{.) = ^ [3jkK (xik, •) + b k 

Vi=l / l<fc<Q 

where the Xik are elements of X (the (3ik and bk are scalars), as well as the 
limits of these functions when the sets {xik : 1 < i < /&} become dense in X 
in the norm induced by the dot product. Due to ([3]), Tt can also be seen as 
a multivariate affine model on <1> (X) . Functions h can then be rewritten as: 

h{.) = ({Wk,.) +Ml<fc<Q 

where vectors Wk are elements of E$( X y They are thus described by the 
pair (w, b) with w = (wk) 1<k< n and b = (fcfc) 1 <fe<g. Let H stand for the 



VC dimensions for Classifiers Taking Values in R'' 



7 



product space H®. Its norm ||.||^ is given by \\h\\^ — \/j2k-. 



Definition 13 (M-SVM). A M-SVM is a large margin multi-category dis- 
criminant model obtained by minimizing over the hyperplane X)fc=i = 
of H an objective function of the form: 

m 

J ( h ) = ^2 4i-sVM (Vi, h (xj)) + A ||w|| 2 
»=1 

where the empirical term, used in place of the empirical risk, involves a loss 
function ^m-SVM which is convex. 

The M-SVMs only differ in the nature of ^m-SVM- The specification of this 
function is such that the introduction of the penalizcr ||w|| 2 tends to maxi- 
mize a notion of margin directly connected with the one of Definition [T] The 
formulation of the generalized Sauer lemma provided here (Theorem [2]) is 
the one obtained under the weakest hypotheses. Proceeding as in the bi-class 
case, we express below a bound on the margin Natarajan dimension of the 
M-SVMs as a function of the volume occupied by data in E$ix) an d con- 
straints on (w, b), thus restricting the study to functions with a well-defined 
range. In that variant of Theorem [2] can be derived from Lemma 7 in 

|Guermeur, 20 04 which does not involve 7r 7 but relates the covering numbers 
of A*Q to the margin Natarajan dimension of AQ. Its use for M-SVMs is 
advantageous since N-dim (AH, e) is easier to bound than N-dim {A 1 H, e) 
(nonlincarity is difficult to handle) . This change of generalized Sauer lemma 
calls for the use of an intermediate formula relating the covering numbers of 
A*H and A*H. It is provided by the following lemma. 

Lemma 1 (Lemmas 9 and 10 in |Guermeur, 2004|). LetH be the class 
of functions that a Q-category M-SVM can implement under the hypothesis 
b e [-,3,/3] Q . Let ( 7 ,e) e M 2 satisfy < e < 7 < 1. Th 



en 



JV/£k(e,4;w,m)<(2 t +lj ^(eA^m). (4) 

A final theorem then completes the construction of the guaranteed risk. 

Theorem 3 (Theorem 5 in |Gu ermeur, 2004| ). Let TL be the class of 
functions that a Q-category M-SVM can implement under the hypothesis that 
^{X) is included in the closed ball of radius A$^x) about the origin in E^^x) 
and the constraints 1/2 maxi</ c< ;<Q \\wk — wi\\ < A w and b = 0. Then, for 
any positive real value e, the following bound holds true: 

N-dim (AH, e)<C 2 Q ( AwA ^ x A * . (5) 
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The proof follows the line of argument of the corresponding bi-class result, 
Theorem 4.6 in [Bartlett and Shawe- Taylor, 1999| . This involves a general- 
ization of Lemma 4.2 which can only be performed for the A operator. The 
discussion on the presence of both A and A* in Theorem[5]is thus completed. 
Putting things together, the control term of the guaranteed risk decreases 
with the size of the training sample as ln(m) • m" 1 / 2 . This represents an 
improvement over the rate obtained in |Guermeur et al., 2005| , m -1 / 4 . 

5 Conclusions and future work 

A new class of generalized VC dimensions dedicated to large margin multi- 
category discriminant models has been introduced. They can be seen either 
as multivariate extensions of the fat-shattering dimension or scale-sensitive 
^-dimensions. Their finiteness (for all positive values of the scale parameter 
7) is also a necessary and sufficient condition for learnability. A general- 
ized Sauer lemma has been provided for one of these capacity measures, 
the margin Natarajan dimension. This latter dimension has been bounded 
from above in the case where the classifier is a multi-class SVM. This study 
provides us with new arguments to support the thesis that the theory of 
multi-category pattern recognition cannot be developed by extending in a 
straightforward way bi-class results. We are currently making use of the 
specificities identified here to extend new concentration inequalities to the 
multi-class case with the goal to obtain improved convergence rates. 
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